- Harsh Maheshkumar Pandya
- Rushank Ghanshyam
Sheta
- Shubhkirti Prasad
The main goal of our EDA is to know and understand the distribution of Video Game sales data over the years in different continents. Through the years there have been many Console releases; Sony PlayStation, Nintendo Wii and Xbox 360 being some of the most famous ones.
Our dataset contains a list of video games with sales greater than 100,000 copies. It was orginally generated by a scrape of vgchartz.com. For our analysis, the dataset was taken directly from Kaggle.
Fields include
Rank - Ranking of overall sales
Name - The games name
Platform - Platform of the games release (i.e. PC,PS4, etc.)
Year - Year of the game’s release
Genre - Genre of the game
Publisher - Publisher of the game
NA_Sales - Sales in North America (in millions)
EU_Sales - Sales in Europe (in millions)
JP_Sales - Sales in Japan (in millions)
Other_Sales - Sales in the rest of the world (in millions)
Global_Sales - Total worldwide sales.
#install.packages("dplyr")
#install.packages("tidyverse")
#install.packages("gapminder")
#install.packages("GGally")
#install.packages("arm")
#install.packages("vcd")
#install.packages("devtools")
#install.packages("ade4")
#install.packages("rlang")
#install.packages("viridis")
#install.packages("plyr")
library("plyr")
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
library(stringr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange() masks plyr::arrange()
## ✖ purrr::compact() masks plyr::compact()
## ✖ dplyr::count() masks plyr::count()
## ✖ dplyr::desc() masks plyr::desc()
## ✖ tidyr::extract() masks magrittr::extract()
## ✖ dplyr::failwith() masks plyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::id() masks plyr::id()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::mutate() masks plyr::mutate()
## ✖ dplyr::rename() masks plyr::rename()
## ✖ purrr::set_names() masks magrittr::set_names()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gapminder)
library(broom)
library(viridis)
## Loading required package: viridisLite
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(arm)
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loading required package: lme4
##
## arm (Version 1.13-1, built: 2022-8-25)
##
## Working directory is /Users/rushank/My Drive/mac_gdrive/Exploratory Data Analysis/Project/5847734
library(vcd)
## Loading required package: grid
library(devtools)
## Loading required package: usethis
library(ade4)
library(rlang)
##
## Attaching package: 'rlang'
##
## The following objects are masked from 'package:purrr':
##
## %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
## flatten_raw, invoke, splice
##
## The following object is masked from 'package:magrittr':
##
## set_names
#install_github("vqv/ggbiplot")
library(ggbiplot)
## Loading required package: scales
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:arm':
##
## rescale
##
## The following object is masked from 'package:viridis':
##
## viridis_pal
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
#devtools::install_github("Ryo-N7/tvthemes")
invisible(theme_set(theme_bw()) + theme_update(plot.title = element_text(hjust = 0.5)))
number_ticks = function(n) {function(limits) pretty(limits, n)}
sales_data = read.csv('vgsales.csv')
dim(sales_data)
## [1] 16598 11
head(sales_data)
Rank <int> | Name <chr> | Platform <chr> | Year <chr> | Genre <chr> | Publisher <chr> | NA_Sales <dbl> | EU_Sales <dbl> | JP_Sales <dbl> | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Wii Sports | Wii | 2006 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | |
| 2 | 2 | Super Mario Bros. | NES | 1985 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | |
| 3 | 3 | Mario Kart Wii | Wii | 2008 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | |
| 4 | 4 | Wii Sports Resort | Wii | 2009 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | |
| 5 | 5 | Pokemon Red/Pokemon Blue | GB | 1996 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | |
| 6 | 6 | Tetris | GB | 1989 | Puzzle | Nintendo | 23.20 | 2.26 | 4.22 |
str(sales_data)
## 'data.frame': 16598 obs. of 11 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
## $ Platform : chr "Wii" "NES" "Wii" "Wii" ...
## $ Year : chr "2006" "1985" "2008" "2009" ...
## $ Genre : chr "Sports" "Platform" "Racing" "Sports" ...
## $ Publisher : chr "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
## $ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
## $ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
## $ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
summary(sales_data)
## Rank Name Platform Year
## Min. : 1 Length:16598 Length:16598 Length:16598
## 1st Qu.: 4151 Class :character Class :character Class :character
## Median : 8300 Mode :character Mode :character Mode :character
## Mean : 8301
## 3rd Qu.:12450
## Max. :16600
## Genre Publisher NA_Sales EU_Sales
## Length:16598 Length:16598 Min. : 0.0000 Min. : 0.0000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000
## Mode :character Mode :character Median : 0.0800 Median : 0.0200
## Mean : 0.2647 Mean : 0.1467
## 3rd Qu.: 0.2400 3rd Qu.: 0.1100
## Max. :41.4900 Max. :29.0200
## JP_Sales Other_Sales Global_Sales
## Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
## 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
## Median : 0.00000 Median : 0.01000 Median : 0.1700
## Mean : 0.07778 Mean : 0.04806 Mean : 0.5374
## 3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4700
## Max. :10.22000 Max. :10.57000 Max. :82.7400
#converting year to a numeric column
sales_data$Year = as.numeric(sales_data$Year,na.rm = TRUE)
## Warning: NAs introduced by coercion
unique(sales_data$Year)
## [1] 2006 1985 2008 2009 1996 1989 1984 2005 1999 2007 2010 2013 2004 1990 1988
## [16] 2002 2001 2011 1998 2015 2012 2014 1992 1997 1993 1994 1982 2003 1986 2000
## [31] NA 1995 2016 1991 1981 1987 1980 1983 2020 2017
na_indices = is.na(sales_data)
na_values = sales_data[na_indices]
na_values
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [151] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [176] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [201] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [226] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [251] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# removing all the NA values
sales_data = na.omit(sales_data)
sales_data2 = data.frame(sales_data)
sales_data2[ , c("Platform", "Year", "Genre", "Publisher")] <- lapply(sales_data2[ , c("Platform", "Year", "Genre", "Publisher")], as.factor)
summary(sales_data2)
## Rank Name Platform Year
## Min. : 1 Length:16327 DS :2133 2009 :1431
## 1st Qu.: 4136 Class :character PS2 :2127 2008 :1428
## Median : 8295 Mode :character PS3 :1304 2010 :1259
## Mean : 8293 Wii :1290 2007 :1202
## 3rd Qu.:12442 X360 :1235 2011 :1139
## Max. :16600 PSP :1197 2006 :1008
## (Other):7041 (Other):8860
## Genre Publisher NA_Sales
## Action :3253 Electronic Arts : 1339 Min. : 0.0000
## Sports :2304 Activision : 966 1st Qu.: 0.0000
## Misc :1710 Namco Bandai Games : 928 Median : 0.0800
## Role-Playing:1471 Ubisoft : 918 Mean : 0.2654
## Shooter :1282 Konami Digital Entertainment: 823 3rd Qu.: 0.2400
## Adventure :1276 THQ : 712 Max. :41.4900
## (Other) :5031 (Other) :10641
## EU_Sales JP_Sales Other_Sales Global_Sales
## Min. : 0.0000 Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
## 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
## Median : 0.0200 Median : 0.00000 Median : 0.01000 Median : 0.1700
## Mean : 0.1476 Mean : 0.07866 Mean : 0.04832 Mean : 0.5402
## 3rd Qu.: 0.1100 3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4800
## Max. :29.0200 Max. :10.22000 Max. :10.57000 Max. :82.7400
##
ggplot(sales_data, aes(x=Global_Sales)) + geom_histogram(fill='navy', color='lightblue') + ggtitle('Global Sales - Histogram (bins=50)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(sales_data, aes(x=Global_Sales)) + geom_histogram(bins=100, fill='navy', color='lightblue') + ggtitle('Lowest Global Sales - Histogram (bins=100)') + xlim(0, 5)
## Warning: Removed 204 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 2 rows containing missing values (`geom_bar()`).
# log transformation on Global Sales
ggplot(sales_data) + geom_histogram(aes(log(Global_Sales)), fill='navy', color='lightblue')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Inference:
From the above histogram plots, we can see that
the data is highly skewed and mostof the games have sales less than 1
million dollars, hence while fitting the models we should Log transform
the column.
# checking out the Year column
sales_data %>%
group_by(Year) %>%
count() %>%
ggplot() +
geom_density(aes(Year)) + scale_x_continuous(breaks=number_ticks(12)) + ggtitle('Global Sales by Year')
# Sales by platform :
sales_data %>% group_by(Platform) %>%
dplyr::summarise(sales = sum(Global_Sales)) %>% ggplot() +
geom_bar(aes(reorder(Platform, sales), sales), stat = "identity",
fill = "steelblue") +
xlab("Platform") + ylab("Global sales") + scale_y_continuous(breaks=number_ticks(5)) + ggtitle('Global Sales by Platform') +
coord_flip()
# combining the platforms together to get a better representation
sales_data = sales_data %>%
mutate(platform2 = case_when(
Platform %in% c("Wii", "DS", "3DS", "WiiU", "GC", "GB","GBA") ~ "Nintendo",
Platform %in% c("X360", "XB", "XOne") ~ "XBox",
Platform %in% c("PS3", "PS4", "PS2", "PS", "PSP", "PSV") ~ "PS",
Platform == "PC" ~ "PC",
Platform == "DC" ~ "Sega"
))
sales_data %>% group_by(platform2) %>%
dplyr::summarise(sales = sum(Global_Sales)) %>% ggplot() +
geom_bar(aes(reorder(platform2, sales), sales), stat = "identity",
fill = "steelblue") +
xlab("Platform") + ylab("Global sales") + scale_y_continuous(breaks=number_ticks(5)) + ggtitle('Global Sales for Most Popular Platforms') +
coord_flip()
Inference:
From the above barplots we can observe the top
3 most popular platforms(most sales over the years) for playing games
are PS2, XBox 360 and PS3 respectively. But when we group by the
manufacturer of the consoles and count the aggregate sales, we find that
Play Station is the top performer while XBox has lowe sales than
Nintendo’s consoles, and Sega has the lowest of all.
ggplot(sales_data, aes(x=Genre, y=Global_Sales, fill=Genre, color=Genre)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(size=10, angle=90), legend.position = 'bottom', aspect.ratio=10/15, legend.key.size = unit(0.3, 'cm')) +
ggtitle('Overall Sales by Genre')
# Count occurrences of each value in the Col1 column
value_counts <- table(sales_data$Publisher)
# Extract values with a count of 1
values_with_count_1 <- names(value_counts[value_counts > 411])
# Filter the dataframe based on values with count 1 in Col1 column
filtered_df <- sales_data %>%
filter(Publisher %in% values_with_count_1)
ggplot(filtered_df, aes(x=Publisher, y=Global_Sales, fill=Publisher)) + theme(legend.position="bottom", axis.text.y=element_text(size=8, angle=45), legend.key.size = unit(0.3, 'cm')) +
geom_bar(stat="identity") + ggtitle("Overall Sales by Publisher") + coord_flip()
Inference:
From the above two barplots, we can observe that
the Genres Action, Shooter and Sports contribute higer to the total
sales while lowest purchases games belongs to genres like Puzzle,
Strategy and Adventure.
And Publishers like Nintendo, Electronic
Arts and Activision seems to be the market leaders in terms of Total
sales over all of the years.
# Select only numeric columns
num_cols = sapply(sales_data, is.numeric)
df_num = sales_data[, num_cols]
# Create ggpairs for numeric columns
ggpairs(df_num, title='Pair Plot')
Inference:
As anticipated, [NA, EU, JP, Other]Sales are
highly correlated with Global Sales and each other which suggests that
if a newly launched game has comparatively higher sales than its
competitors in one region, then launching the same game in other regions
might result in comparatively similar trend probablistically.
Year
is not significantly correlated with any other attribute
color <- c("Titles released" = "red", "Global sales" = "blue")
sales_data %>%
group_by(Year) %>%
dplyr::summarise(sales = sum(Global_Sales), count = n())%>%
ggplot() + geom_line(aes(Year, count, group = 1, color = "Titles released")) +
geom_line(aes(Year, sales, group = 1, color = "Global sales")) +
ylab("Titles released") +
scale_color_manual(values = color) + labs(color = "") + scale_x_continuous(breaks=number_ticks(12)) + ggtitle('Titles Released and Sales by Year')
Inference:
# Log transforming the sales column to check the relationship
ggplot(sales_data, aes(x = Year, y = log10(Global_Sales))) + geom_point()
colnames(sales_data)
## [1] "Rank" "Name" "Platform" "Year" "Genre"
## [6] "Publisher" "NA_Sales" "EU_Sales" "JP_Sales" "Other_Sales"
## [11] "Global_Sales" "platform2"
table(sales_data$Year)
##
## 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## 9 46 36 17 14 14 21 16 15 17 16 41 43 60 121 219
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
## 263 289 379 338 349 482 829 775 763 941 1008 1202 1428 1431 1259 1139
## 2012 2013 2014 2015 2016 2017 2020
## 657 546 582 614 344 3 1
#sales_data[sales_data$Year > 2016, ]
# Since there is not enough data past 2016, we shall limit our analysis only till 2016
sales_data = sales_data[sales_data$Year < 2017, ]
sales_by_year = aggregate(cbind(NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales) ~ Year, data = sales_data, sum)
Line-Plots of the sales columns :
# Area comparison
ggplot(sales_by_year, aes(x = Year)) +
geom_line(aes(y = NA_Sales, color = "NA_Sales")) +
geom_line(aes(y = EU_Sales, color = "EU_Sales")) +
geom_line(aes(y = JP_Sales, color = "JP_Sales")) +
geom_line(aes(y = Other_Sales, color = "Other_Sales")) +
geom_line(aes(y = Global_Sales, color = "Global_Sales")) +
labs(x = "Year", y = "Sales", color = "Sales Type") +
ggtitle("Sales by Year") +
scale_color_manual(values = c("NA_Sales" = "red", "EU_Sales" = "blue",
"JP_Sales" = "green", "Other_Sales" = "orange",
"Global_Sales" = "black")) + theme(legend.position = 'bottom', aspect.ratio=10/15) + scale_x_continuous(breaks=number_ticks(12))
# Reshape data to long format
sales_by_year_long <- reshape2::melt(sales_by_year, id.vars = "Year", variable.name = "Sales_Location", value.name = "Sales")
#head(sales_by_year_long)
sales_by_year_short <- sales_by_year_long[sales_by_year_long$Sales_Location != "Global_Sales", ]
ggplot(data = sales_by_year_short) +
geom_line(aes(x = Year, y = Sales, colour = Sales_Location, group=Sales_Location)) + scale_x_continuous(breaks=number_ticks(12)) +
ggtitle('Reginal Sales Over the Years')
sales_data %>% group_by(platform2, Year) %>%
dplyr::summarise(sales = sum(Global_Sales)) %>%
ggplot() +
geom_line(aes(Year, sales, group = platform2, color = platform2)) +
xlab("Year of release") + ylab("Global Sales") + labs(color = "") + ggtitle('Global Sales for each Platform')
## `summarise()` has grouped output by 'platform2'. You can override using the
## `.groups` argument.
# Top 3 best sellers overall
global_sales_sorted <- sales_data[order(-sales_data$Global_Sales), ]
top_global_sales <- global_sales_sorted[1:3, c("Name", "Platform", "Global_Sales")]
print(top_global_sales)
## Name Platform Global_Sales
## 1 Wii Sports Wii 82.74
## 2 Super Mario Bros. NES 40.24
## 3 Mario Kart Wii Wii 35.82
# NA Subset and sort
na_sales <- sales_data[!is.na(sales_data$NA_Sales), ]
na_sales_sorted <- na_sales[order(-na_sales$NA_Sales), ]
# top 10 with highest sales
top_na_sales <- na_sales_sorted[1:10, c("Name", "Platform", "NA_Sales")]
t_v_name_NA <- aggregate(list(NA_Sales = sales_data$NA_Sales), list(Name = sales_data$Name), sum)
t_v_name_NA <- t_v_name_NA[order(t_v_name_NA$NA_Sales, decreasing = T), ]
ggplot(data = top_na_sales, aes(x = Name, y = NA_Sales, fill = Name)) +
geom_bar(stat = "identity", color='white') +
labs(x = "Game Title", y = "NA Sales (in millions)",
title = "Top 10 Best Sellers in North America") +
theme(axis.text.y=element_text(size=7, angle=45), legend.position = 'none') + coord_flip()
# NA Subset and sort
na_sales <- sales_data[!is.na(sales_data$NA_Sales), ]
na_sales_sorted <- na_sales[order(-na_sales$NA_Sales), ]
# top 10 with highest sales
top_na_sales <- na_sales_sorted[1:10, c("Name", "Platform", "NA_Sales")]
t_v_name_NA <- aggregate(list(NA_Sales = sales_data$NA_Sales), list(Name = sales_data$Name), sum)
t_v_name_NA <- t_v_name_NA[order(t_v_name_NA$NA_Sales, decreasing = T), ]
ggplot(data = top_na_sales, aes(x = Name, y = NA_Sales, fill = Platform)) +
geom_bar(stat = "identity", color='white') +
labs(x = "Game Title", y = "NA Sales (in millions)",
title = "Top 10 Best Sellers in North America") +
theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()
# EU Subset and sort
eu_sales <- sales_data[!is.na(sales_data$EU_Sales), ]
eu_sales_sorted <- eu_sales[order(-eu_sales$EU_Sales), ]
# top 10 with highest sales
top_eu_sales <- eu_sales_sorted[1:10, c("Name", "Platform", "EU_Sales")]
ggplot(data = top_eu_sales, aes(x = Name, y = EU_Sales, fill = Platform)) +
geom_bar(stat = "identity", color='white') +
labs(x = "Game Title", y = "EU Sales (in millions)",
title = "Top 10 Best Sellers in Europe") +
theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()
# JP Subset and sort
jp_sales <- sales_data[!is.na(sales_data$JP_Sales), ]
jp_sales_sorted <- jp_sales[order(-jp_sales$JP_Sales), ]
# top 10 with highest sales
top_jp_sales <- jp_sales_sorted[1:10, c("Name", "Platform", "JP_Sales")]
ggplot(data = top_jp_sales, aes(x = Name, y = JP_Sales, fill = Platform)) +
geom_bar(stat = "identity", color='white') +
labs(x = "Game Title", y = "JP Sales (in millions)",
title = "Top 10 Best Sellers in Japan") +
theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()
# JP Subset and sort
o_sales <- sales_data[!is.na(sales_data$Other_Sales), ]
o_sales_sorted <- o_sales[order(-o_sales$Other_Sales), ]
# top 10 with highest sales
top_o_sales <- o_sales_sorted[1:10, c("Name", "Platform", "Other_Sales")]
ggplot(data = top_o_sales, aes(x = Name, y = Other_Sales, fill = Platform)) +
geom_bar(stat = "identity", color='white') +
labs(x = "Game Title", y = "Other Sales (in millions)",
title = "Top 10 Best Sellers in Other Regions") +
theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()
# NA Subset and sort
na_sales <- sales_data[!is.na(sales_data$NA_Sales), ]
na_sales_sorted <- na_sales[order(-na_sales$NA_Sales), ]
# top 10 with highest sales
top_na_sales <- na_sales_sorted[1:10, c("Name", "Platform", "NA_Sales", "Genre")]
t_v_name_NA <- aggregate(list(NA_Sales = sales_data$NA_Sales), list(Name = sales_data$Name), sum)
t_v_name_NA <- t_v_name_NA[order(t_v_name_NA$NA_Sales, decreasing = T), ]
ggplot(data = top_na_sales, aes(x = Name, y = NA_Sales, fill = Genre)) +
geom_bar(stat = "identity", color='white') +
labs(x = "Game Title", y = "NA Sales (in millions)",
title = "Top 10 Best Sellers in North America ~ Genre") +
theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()
# EU Subset and sort
eu_sales <- sales_data[!is.na(sales_data$EU_Sales), ]
eu_sales_sorted <- eu_sales[order(-eu_sales$EU_Sales), ]
# top 10 with highest sales
top_eu_sales <- eu_sales_sorted[1:10, c("Name", "Platform", "EU_Sales", "Genre")]
ggplot(data = top_eu_sales, aes(x = Name, y = EU_Sales, fill = Genre)) +
geom_bar(stat = "identity", color='white') +
labs(x = "Game Title", y = "EU Sales (in millions)",
title = "Top 10 Best Sellers in Europe ~ Genre") +
theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()
# JP Subset and sort
jp_sales <- sales_data[!is.na(sales_data$JP_Sales), ]
jp_sales_sorted <- jp_sales[order(-jp_sales$JP_Sales), ]
# top 10 with highest sales
top_jp_sales <- jp_sales_sorted[1:10, c("Name", "Platform", "JP_Sales", "Genre")]
ggplot(data = top_jp_sales, aes(x = Name, y = JP_Sales, fill = Genre)) +
geom_bar(stat = "identity", color='white') +
labs(x = "Game Title", y = "JP Sales (in millions)",
title = "Top 10 Best Sellers in Japan ~ Genre") +
theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()
# JP Subset and sort
o_sales <- sales_data[!is.na(sales_data$Other_Sales), ]
o_sales_sorted <- o_sales[order(-o_sales$Other_Sales), ]
# top 10 with highest sales
top_o_sales <- o_sales_sorted[1:10, c("Name", "Platform", "Other_Sales", "Genre")]
ggplot(data = top_o_sales, aes(x = Name, y = Other_Sales, fill = Genre)) +
geom_bar(stat = "identity", color='white') +
labs(x = "Game Title", y = "Other Sales (in millions)",
title = "Top 10 Best Sellers in Other Regions ~ Genre") +
theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()
years = unique(sales_data$Year)
df_top_p = data.frame(Year = integer(),
Publisher = character(),
Global_Sales = numeric(),
stringsAsFactors = FALSE)
for (year in years) {
# games released for current year
year_sales <- subset(sales_data, Year == year)
# Aggregate sales
publisher_sales <- aggregate(Global_Sales ~ Publisher, year_sales, sum)
# sory by descending and take the top
top_publisher <- publisher_sales[order(-publisher_sales$Global_Sales), "Publisher"][1]
# add into newdf
df_top_p <- rbind(df_top_p, data.frame(Year = year, Publisher = top_publisher, Global_Sales = publisher_sales[publisher_sales$Publisher == top_publisher, "Global_Sales"]))
}
ggplot(data = df_top_p, aes(x = Year, y = Global_Sales, fill = Publisher)) +
geom_col(position = "dodge") +
labs(x = "Year", y = "Global Sales (in millions)",
title = "Top Publisher by Global Sales for Each Year") +
theme(axis.text.x=element_text(size=8, angle=90), legend.position = 'bottom', legend.key.size = unit(0.2, 'cm')) + scale_x_continuous(breaks=number_ticks(21))
Inference:
As we can see from the above plot, Nintendo
was the top Publisher for most of the years. How ever we can see a trend
where Electronic Arts started selling more games than Nintendo from year
2002 until 2005 after which again Nintendo was top publisher. And lastly
from the year 2010 Electronic Art’s sales were more than Nintendo.
Hence We are particularly interested in analyzing data from years
2005-2009 and 2010-2014 which will help us understand how Electronic
Arts was able to beat it competitor Nintendo.
#head(filtered_df)
mid5y = filtered_df %>% filter(Year>=2005, Year<=2009)
last5y = filtered_df %>% filter(Year>=2010, Year<=2014)
last5y
Rank <int> | Name <chr> | Platform <chr> | Year <dbl> | Genre <chr> | |
|---|---|---|---|---|---|
| 17 | Grand Theft Auto V | PS3 | 2013 | Action | |
| 24 | Grand Theft Auto V | X360 | 2013 | Action | |
| 27 | Pokemon Black/Pokemon White | DS | 2010 | Role-Playing | |
| 30 | Call of Duty: Modern Warfare 3 | X360 | 2011 | Shooter | |
| 32 | Call of Duty: Black Ops | X360 | 2010 | Shooter | |
| 33 | Pokemon X/Pokemon Y | 3DS | 2013 | Role-Playing | |
| 35 | Call of Duty: Black Ops II | PS3 | 2012 | Shooter | |
| 36 | Call of Duty: Black Ops II | X360 | 2012 | Shooter | |
| 38 | Call of Duty: Modern Warfare 3 | PS3 | 2011 | Shooter | |
| 41 | Call of Duty: Black Ops | PS3 | 2010 | Shooter |
ggplot(mid5y, aes(x=Year, y=Global_Sales))+
geom_bar(stat='identity', fill="navy")+
facet_wrap(~Publisher, nrow=2) + ggtitle('Sales from year 2005-2009 ~ Publisher') + theme(axis.text.x=element_text(size=9, angle=90))
ggplot(filter(mid5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo'), aes(x=Year, y=Global_Sales))+
geom_bar(stat='identity', fill="navy")+
facet_wrap(~Publisher, nrow=1) + ggtitle('Sales from 2005-2009 ~ Publisher') + theme(axis.text.x=element_text(size=8, angle=90))
ggplot(last5y, aes(x=Year, y=Global_Sales))+
geom_bar(stat='identity', fill="navy")+
facet_wrap(~Publisher, nrow=2) + ggtitle('Sales from year 2010-2014 ~ Publisher') + theme(axis.text.x=element_text(size=9, angle=90))
ggplot(filter(last5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo'), aes(x=Year, y=Global_Sales))+
geom_bar(stat='identity', fill="navy")+
facet_wrap(~Publisher, nrow=1) + ggtitle('Sales from 2010-2014 ~ Publisher') + theme(axis.text.x=element_text(size=8, angle=90))
Inference:
From the above two faceted plots, we can see
that Nintendo, Electronic Arts and Acticision are having the best sales
In year 2005-2009 Nintendo sales were all time high about 200
million and averaged around 130 million over the period of 5 years,
while Electronic Arts had average sales of 75 million followed by
Activision with 60 million in sales
However, from year 2010
Nintendo’s sales dropped to average 55 million and Electronic Art’s
average sales were highest around 60 million followed by Activision with
average of 50 million.
In summary, we saw that Nintendo was having
highest sales in mid-5-years while in the last-5-years Electronic Arts
crossed Nintendo in average sales and Activision was also able to
increase the sales comparatively.
We are interested in finding the
probable reason why Nintendo’s sales dropped by around 50% while sales
for Electronic Arts and Activision increased.
count_last5y <- last5y %>%
group_by(Publisher, Genre, Year) %>%
dplyr::summarise(count = n())
## `summarise()` has grouped output by 'Publisher', 'Genre'. You can override
## using the `.groups` argument.
count_last5y
Publisher <chr> | Genre <chr> | Year <dbl> | count <int> | |
|---|---|---|---|---|
| Activision | Action | 2010 | 19 | |
| Activision | Action | 2011 | 24 | |
| Activision | Action | 2012 | 29 | |
| Activision | Action | 2013 | 8 | |
| Activision | Action | 2014 | 22 | |
| Activision | Adventure | 2010 | 2 | |
| Activision | Adventure | 2011 | 2 | |
| Activision | Adventure | 2014 | 3 | |
| Activision | Misc | 2010 | 8 | |
| Activision | Misc | 2011 | 13 |
count_mid5y <- mid5y %>%
group_by(Publisher, Genre, Year) %>%
dplyr::summarise(count = n())
## `summarise()` has grouped output by 'Publisher', 'Genre'. You can override
## using the `.groups` argument.
#filter(count_last5y, Publisher=='Activision')[,2:4]
ggplot(data = count_mid5y, aes(x=Genre, y=count, fill=Publisher)) +
geom_bar(stat="identity") +
facet_wrap('Year', nrow=1) + ylim(0,150) +
theme(axis.text.x=element_text(size=9, angle=90), legend.position = 'bottom', legend.key.size = unit(0.2, 'cm')) + ggtitle('Games Released for each Genre ~ mid-5-Years')
ggplot(data = count_last5y, aes(x=Genre, y=count, fill=Publisher)) +
geom_bar(stat="identity") +
facet_wrap('Year', nrow=1) + ylim(0,150) +
theme(axis.text.x=element_text(size=9, angle=90), legend.position = 'bottom', legend.key.size = unit(0.3, 'cm')) + ggtitle('Games Released for each Genre ~ last-5-Years')
Inference:
The above plots are hard to interpret but we
can spot a trend that the relatively higher number of games are
realeased for Action category
To better understand similar kind of
plots we will only consider publishers like “Nintendo”, “Activision” and
“Electronic Arts” since we spotted a pattern for these publishers in
previous plots for this set of years
count_mid5y <- mid5y %>%
group_by(Publisher, Genre, Year) %>%
dplyr::summarise(count = n())
## `summarise()` has grouped output by 'Publisher', 'Genre'. You can override
## using the `.groups` argument.
count_mid5y
Publisher <chr> | Genre <chr> | Year <dbl> | count <int> | |
|---|---|---|---|---|
| Activision | Action | 2005 | 22 | |
| Activision | Action | 2006 | 8 | |
| Activision | Action | 2007 | 22 | |
| Activision | Action | 2008 | 32 | |
| Activision | Action | 2009 | 44 | |
| Activision | Adventure | 2005 | 5 | |
| Activision | Adventure | 2007 | 5 | |
| Activision | Adventure | 2008 | 2 | |
| Activision | Adventure | 2009 | 4 | |
| Activision | Misc | 2005 | 4 |
ggplot(data = filter(count_mid5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo'), aes(x=Year, y=count, fill=Publisher, group=Publisher)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap('Genre', nrow=1) +
theme(axis.text.x=element_text(size=8, angle=90), legend.key.size = unit(0.2, 'cm')) + ggtitle('Games released by Publishers(2005-2009) ~ Genre') + ylim(0,40)
## Warning: Removed 5 rows containing missing values (`geom_bar()`).
m1 = filter(count_mid5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo')
m2 = filter(m1, Genre=='Action' | Genre=='Adventure' | Genre=='Puzzle' | Genre=='Shooter' | Genre=='Sports' | Genre=='Racing')
ggplot(data = filter(m2, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo'), aes(x=Year, y=count, fill=Publisher, group=Publisher)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap('Genre', nrow=1) +
theme(axis.text.x=element_text(size=8, angle=90), legend.key.size = unit(0.2, 'cm')) + ggtitle('Games released by Publishers(2005-2009) ~ Genre') + ylim(0,40)
## Warning: Removed 5 rows containing missing values (`geom_bar()`).
l1 = filter(count_last5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo')
l2 = filter(l1, Genre=='Action' | Genre=='Adventure' | Genre=='Puzzle' | Genre=='Shooter' | Genre=='Sports' | Genre=='Racing'| Genre=='Misc')
ggplot(data = l2, aes(x=Year, y=count, fill=Publisher, group=Publisher)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap('Genre', nrow=1) +
theme(axis.text.x=element_text(size=8, angle=90), legend.key.size = unit(0.2, 'cm')) + ggtitle('Games released by Publishers(2010-2014) ~ Genre') + ylim(0,40)
Inference:
We can see that grouped barplot is not a
good representation hence using stacked barplot to better understand the
trends for game relases over the same set of years
m1 = filter(count_mid5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo')
m2 = filter(m1, Genre=='Action' | Genre=='Puzzle' | Genre=='Shooter' | Genre=='Sports' | Genre=='Racing' | Genre=='Misc')
ggplot(data = m2, aes(x=Year, y=count, fill=Publisher)) +
geom_bar(stat="identity") +
facet_wrap('Genre', nrow=2) +
theme(axis.text.x=element_text(size=10, angle=90), legend.key.size = unit(0.2, 'cm'), legend.position = 'bottom') + ggtitle('Games released by Publishers(2005-2009) ~ Genre') + ylim(0,40)
## Warning: Removed 5 rows containing missing values (`position_stack()`).
## Warning: Removed 3 rows containing missing values (`geom_bar()`).
l1 = filter(count_last5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo')
l2 = filter(l1, Genre=='Action' | Genre=='Puzzle' | Genre=='Shooter' | Genre=='Sports' | Genre=='Racing'| Genre=='Misc')
ggplot(data = l2, aes(x=Year, y=count, fill=Publisher)) +
geom_bar(stat="identity") +
facet_wrap('Genre', nrow=2) +
theme(axis.text.x=element_text(size=10, angle=90), legend.key.size = unit(0.2, 'cm'), legend.position = 'bottom') + ggtitle('Games released by Publishers(2010-2014) ~ Genre') + ylim(0,40)
## Warning: Removed 3 rows containing missing values (`geom_bar()`).
sales_by_publisher = sales_data %>% group_by(Publisher) %>%
dplyr::summarise(sales = sum(Global_Sales)) %>%
arrange(desc(sales)) %>% slice(1:10)
top_publishers = head(sales_by_publisher$Publisher, n = 10)
filtered_df = sales_data %>%filter(Publisher %in% top_publishers)
sales_by_year_publisher = aggregate(Global_Sales ~ Publisher + Year, data = filtered_df, sum)
ggplot(sales_by_year_publisher, aes(x = Year, y = Global_Sales)) +
geom_point(alpha=0.75,size=0.5) +
geom_smooth(method = "loess", se = FALSE, span=0.75, color= 'blue', linewidth= 0.6) +
facet_wrap(~ Publisher, ncol = 5) +
labs(title = "Trend of Total Global Sales by Publisher and Year",
x = "Year",
y = "Total Global Sales") +
theme(axis.text.x=element_text(size=9))
## `geom_smooth()` using formula = 'y ~ x'
#geom_smooth(method = "lm",color= 'red', se =FALSE,linewidth= 0.8) +
sales_by_genre_year_publisher = aggregate(Global_Sales ~ Genre + Year, data = filtered_df, sum)
ggplot(sales_by_genre_year_publisher, aes(x = Year, y = Global_Sales)) +
geom_point(alpha=0.75,size=0.5) +
geom_smooth(method = "loess", se = FALSE, span=0.75, color= 'blue', linewidth= 0.6) +
facet_wrap(~ Genre, ncol = 4) +
labs(title = "Total Global Sales by Genre and Year",
x = "Year",
y = "Total Global Sales") +
theme(axis.text.x=element_text(size=9))
## `geom_smooth()` using formula = 'y ~ x'
sales_by_publisher = sales_data %>% group_by(Publisher) %>%
dplyr::summarise(sales = sum(Global_Sales)) %>%
arrange(desc(sales)) %>% slice(1:10)
top_publishers = head(sales_by_publisher$Publisher, n = 10)
filtered_df = sales_data %>%filter(Publisher %in% top_publishers)
sales_by_year_publisher = aggregate(Global_Sales ~ Publisher + Year, data = filtered_df, sum)
ggplot(sales_by_year_publisher, aes(x = Year, y = Global_Sales)) +
geom_point(alpha=0.75,size=0.5) +
geom_smooth(method = "loess", se = FALSE, span=0.75, color= 'blue', linewidth= 0.6) +
geom_smooth(method = "lm",color= 'red', se =FALSE,linewidth= 0.8) +
facet_wrap(~ Publisher, ncol = 5) +
labs(title = "Trend of Total Global Sales by Publisher and Year",
x = "Year",
y = "Total Global Sales") +
theme(axis.text.x=element_text(size=9))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
sales_by_genre_year_publisher = aggregate(Global_Sales ~ Genre + Year, data = filtered_df, sum)
ggplot(sales_by_genre_year_publisher, aes(x = Year, y = Global_Sales)) +
geom_point(alpha=0.75,size=0.5) +
geom_smooth(method = "loess", se = FALSE, span=0.75, color= 'blue', linewidth= 0.6) +
geom_smooth(method = "lm",color= 'red', se =FALSE,linewidth= 0.8) +
facet_wrap(~ Genre, ncol = 4) +
labs(title = "Total Global Sales by Genre and Year",
x = "Year",
y = "Total Global Sales") +
theme(axis.text.x=element_text(size=9))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Main aim:
The main aim behind our model is the check : -
If the sales of the games in Japan influenced by sales of the games in
any other part of the world. - How is does the sales in Japan contribute
to the overall global sales of the games.
filtered_lo = loess(JP_Sales ~ NA_Sales * Global_Sales, data = filtered_df, span = 1/3, parametric = "NA_Sales", drop.square = "NA_Sales", family = "symmetric")
prediction_grid = data.frame(expand.grid(NA_Sales = seq(0.00, 42, by = 6.9), Global_Sales = seq(0.00, 83, by = 8)))
filtered_preds = augment(filtered_lo, newdata = prediction_grid)
ggplot(filtered_preds) +
geom_line(aes(x = NA_Sales, y = .fitted)) +
facet_wrap(~ Global_Sales, ncol = 11) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
scale_y_continuous("Fitted value of JP_Sales")
## Warning: Removed 7 rows containing missing values (`geom_line()`).
NA_Sales and the predicted values of
JP_Sales based on the fitted loess model,
with facets for different values of
Global_Sales. The plot provides insights
into the nature of the relationship between the variables and how it
varies with different values of
Global_Salesggplot(filtered_preds) +
geom_line(aes(x = Global_Sales, y = .fitted,
color = factor(NA_Sales, levels = unique(NA_Sales), ordered = TRUE))) +
guides(color = guide_legend(title = "NA_Sales")) +
scale_y_continuous("Fitted value of JP_Sales")
## Warning: Removed 7 rows containing missing values (`geom_line()`).
filtered_lo = loess(JP_Sales ~ Global_Sales * NA_Sales, data = filtered_df, span = 1/3, parametric = "Global_Sales", drop.square = "Global_Sales", family = "symmetric")
prediction_grid = data.frame(expand.grid(Global_Sales = seq(0.00, 83, by = 6.9), NA_Sales = seq(0.00, 43, by = 8)))
filtered_preds = augment(filtered_lo, newdata = prediction_grid)
ggplot(filtered_preds) +
geom_line(aes(x = Global_Sales, y = .fitted)) +
facet_wrap(~ NA_Sales, ncol = 11) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
scale_y_continuous("Fitted value of JP_Sales")+
ggtitle('Interaction of Global_sales and JP_Sales faceted by NA_Sales')
## Warning: Removed 2 rows containing missing values (`geom_line()`).
ggplot(filtered_preds) +
geom_line(aes(x = NA_Sales, y = .fitted,
color = factor(Global_Sales, levels = unique(Global_Sales), ordered = TRUE))) +
guides(color = guide_legend(title = "Global_Sales")) +
scale_y_continuous("Fitted value of JP_Sales")+
ggtitle('Global_sales given and NA_Sales faceted by NA_Sales')
## Warning: Removed 12 rows containing missing values (`geom_line()`).
Predictive Models :
sales_data = read.csv('vgsales_scores.csv')
sales_data
Name <chr> | Platform <chr> | Year_of_Release <chr> | Genre <chr> | |
|---|---|---|---|---|
| Wii Sports | Wii | 2006 | Sports | |
| Super Mario Bros. | NES | 1985 | Platform | |
| Mario Kart Wii | Wii | 2008 | Racing | |
| Wii Sports Resort | Wii | 2009 | Sports | |
| Pokemon Red/Pokemon Blue | GB | 1996 | Role-Playing | |
| Tetris | GB | 1989 | Puzzle | |
| New Super Mario Bros. | DS | 2006 | Platform | |
| Wii Play | Wii | 2006 | Misc | |
| New Super Mario Bros. Wii | Wii | 2009 | Platform | |
| Duck Hunt | NES | 1984 | Shooter |
sales_data$Year = as.numeric(sales_data$Year,na.rm = TRUE)
## Warning: NAs introduced by coercion
sales_data = na.omit(sales_data)
summary(sales_data$Critic_Score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 62.00 72.00 70.26 80.00 98.00
summary(sales_data$Critic_Count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 14.00 24.00 28.84 39.00 113.00
summary(sales_data$User_Count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 11.0 27.0 174.4 89.0 10665.0
summary(sales_data$User_Score)
## Length Class Mode
## 6894 character character
sales_data$User_Score = as.numeric(sales_data$User_Score)
summary(sales_data$User_Score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.500 6.500 7.500 7.184 8.200 9.600
sales_data$User_Score = sales_data$User_Score * 10
colors = c("Critic score" = "blue", "User score" = "red")
ggplot(sales_data) +
# relating the critic_score to the global sales
geom_smooth(aes(Critic_Score, Global_Sales, color = "Critic score")) +
# relating the user_score to the global sales
geom_smooth(aes(User_Score, Global_Sales, color = "User score")) +
labs(color = "") + xlab("Score") + ylab("Global sales") +
scale_color_manual(values = colors)+
ggtitle('Global_sales given by User and Critic Score')
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
- Here we see that the critic score is a more influential factor to the
global sales.
colors = c("Critic score" = "blue", "User score" = "red")
ggplot(sales_data) +
# relating the critic_score to the global sales
geom_smooth(aes(Critic_Score, JP_Sales, color = "Critic score")) +
# relating the user_score to the global sales
geom_smooth(aes(User_Score, JP_Sales, color = "User score")) +
labs(color = "") + xlab("Score") + ylab("JP sales") +
scale_color_manual(values = colors)+
ggtitle('Japanese Sales given by User and Critic Score')
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
###Conclusions:
Overall we can draw alot of peculiar conclusions from our observations from our analysis and modeling:
Analysis conclusions:
Modeling conclusions :